# Load libraries
library("readxl")
library("writexl")
library("tidyverse")
# Load file
ELP <- read_xlsx("ELP.xlsx")
# Inspect file structure
str(ELP)8 Linear and multiple regression
The following introduction is based on Heumann et al. (2022: Chapter 11), James et al. (2021: Chapter 3), Levshina (2015: Chapter 7) and Winter (2020: Chapter 4).
8.1 Preparation
8.2 Introduction
Consider the distribution of the continuous variable RT (reaction times) from the ELP (English Lexicon Project) dataset. We will \(log\)-transform the reaction times to even out the differences between extremely high and extremely low frequency counts (cf. Winter 2020: 90-94).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We are investigating the relationship between reaction times RT and the frequency Freq of a lexical stimulus.
Some open questions:
Can word frequency help us explain variation in reaction times?
If it can, then how could we characterise the effect of word frequency? In other words, does it increase or decrease reaction times?
What reaction times should we expect for new observations?
8.2.1 Building a statistical model
Here
RTis a response or target that we wish to explain. We generically refer to the response as \(Y\).Freqis a feature, input, or predictor, which we name \(X\).
We can thus summarise our preliminary and fairly general statistical model as
\[ Y = f(X) + \epsilon. \] While the term \(f(X)\) denotes the contribution of \(X\) to the explanation of \(Y\), \(\epsilon\) describes the errors of the model.
8.3 Linear Regression
Linear regression is a simple approach to supervised machine learning where the response variable is known. It assumes that the dependence of \(Y\) on \(X\) is linear.
This approach is suited for numerical response variables. The predictors can be either numerical or discrete/categorical.
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.
8.3.1 Model with a single predictor \(X\)
- A linear model of our data would have the form
\[ Y = \beta_0 + \beta_1X + \epsilon \]
or, in more concrete terms,
\[ \text{Reaction time} = \beta_0 + \beta_1\text{Frequency} + \text{Model Error,} \]
where \(\beta_0\) and \(\beta_1\) are two unknown constants that represent the intercept and slope, respectively. Together they are referred to as the model coefficients (or parameters), and \(\epsilon\) is the error term. The fact that assumptions are made about the form of the model renders it a parametric model.
Given the existing data on \(X\) and \(Y\), which is also known as the training data, we estimate the model coefficients \(\beta_0\) and \(\beta_1\). While we use the training data to estimate these coefficients, we cannot know the true values of the coefficients, i.e., the exact true relationship between the variables.
The most common way of estimating parameters for linear models is the least squares approach. In essence, the parameters are chosen such that the residual sum of squares1, i.e., the sum of the differences between observed and predicted values, is as low as possible.
We can then predict future sales using the formula
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x, \]
where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of the predictor values \(X = x\). The hat symbol ^ marks estimated values.
- In R, we can fit a linear model with the
lm()function.
# Fit linear model
rt.lm1 = lm(log(RT) ~ log(Freq), data = ELP)
# View model data
summary(rt.lm1)
Call:
lm(formula = log(RT) ~ log(Freq), data = ELP)
Residuals:
Min 1Q Median 3Q Max
-0.29765 -0.08203 -0.01205 0.07298 0.43407
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.633361 0.004286 1547.82 <2e-16 ***
log(Freq) -0.048602 0.002201 -22.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1235 on 878 degrees of freedom
Multiple R-squared: 0.357, Adjusted R-squared: 0.3563
F-statistic: 487.5 on 1 and 878 DF, p-value: < 2.2e-16
The model statistics comprise the following elements:
Call, i.e., the model formula.
Residuals: These indicate the difference between the observed values in the data set and the values predicted by the model (= the fitted values). These correspond to the error term \(\epsilon\). The lower the residuals, the better the model describes the data.
# Show fitted values (= predictions) for the first six observations
head(rt.lm1$fitted.values) 1 2 3 4 5 6
6.635345 6.563152 6.789806 6.613980 6.630529 6.574894
# Show deviation of the fitted values from the observed values
head(rt.lm1$residuals) 1 2 3 4 5 6
0.03778825 -0.02277165 0.07759556 0.03387713 0.15222949 -0.10432670
- Coefficients: The regression coefficients correspond to \(\hat{\beta}_0\) (“Intercept”) and \(\hat{\beta}_1\) (“log(Freq)”), respectively. The model shows that for a one-unit increase in log-frequency the log-reaction time decreases by approx. -0.05.
# Convert coefficients to a tibble
library("broom")
tidy_model <- tidy(rt.lm1)
tidy_model# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.63 0.00429 1548. 0
2 log(Freq) -0.0486 0.00220 -22.1 2.88e-86
\(p\)-values and \(t\)-statistic: Given the null hypothesis \(H_0\) that there is no correlation between
log(RT)andlog(Freq)(i.e., \(H_0: \beta_1 = 0\)), a \(p\)-value lower than 0.05 indicates that \(\beta_1\) considerably deviates from 0, thus providing evidence for the alternative hypothesis \(H_1: \beta_1 \ne 0\). Since \(p < 0.001\), we can reject \(H_0\).The \(p\)-value itself crucially depends on the \(t\)-statistic2, which measures “the number of standard deviations that \(\hat{\beta_1}\) is away from 0” (James et al. 2021: 67) . The standard error (SE) reflects how much an estimated coefficient differs on average from the true values of \(\beta_0\) and \(\beta_1\). They can be used to compute the 95% confidence interval \([\hat{\beta}_1 - 2 \cdot SE(\hat{β}_1), \hat{\beta}_1 + 2 \cdot SE(\hat{β}_1)]\); the true estimate of the parameter \(\beta_1\) lies within the specified range 95% of the time.
# Compute confidence intervals for intercept and log(Freq)
tidy_model_ci <- tidy(rt.lm1, conf.int = TRUE)
tidy_model_ci# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.63 0.00429 1548. 0 6.62 6.64
2 log(Freq) -0.0486 0.00220 -22.1 2.88e-86 -0.0529 -0.0443
The estimated parameter for log(Freq), which is -0.049, thus has the 95% confidence interval [-0.053, -0.044].
The residual standard error (RSE)3 is an estimation of the average deviation from the observed values.
\(R^2\)4 is important for assessing model fit because it “measures the proportion of variability in \(Y\) that can be explained using \(X\)” [James et al. (2021): 70; emphasis removed], varying between 0 and 1.
The \(F\)-statistic is used to measure the association between the dependent variable and the independent variable(s). Generally speaking, values greater than 1 indicate a possible correlation. A very low \(p\)-value suggests that the null hypothesis \(H_0: \beta_1 = 0\) can be rejected.
8.3.2 Multiple linear regression
In multiple linear regression, more than one predictor variable is taken into account. For instance, modelling log(RT) as a function of log(Freq), POS and Length requires a more complex model of the form
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon.\]
Predictions are then obtained via the formula
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + ... + \hat{\beta}_px_p. \]
In R, a multiple regression model is fitted as in the code example below:
# Fit multiple regression model
rt.lm2 <- lm(log(RT) ~ log(Freq) + POS + Length, data = ELP)
# View model statistics
summary(rt.lm2)
Call:
lm(formula = log(RT) ~ log(Freq) + POS + Length, data = ELP)
Residuals:
Min 1Q Median 3Q Max
-0.26955 -0.07853 -0.00672 0.07067 0.39528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.459742 0.016946 381.205 < 2e-16 ***
log(Freq) -0.038071 0.002130 -17.874 < 2e-16 ***
POSNN -0.006242 0.010157 -0.615 0.53902
POSVB -0.035234 0.012125 -2.906 0.00375 **
Length 0.023094 0.001711 13.495 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1114 on 875 degrees of freedom
Multiple R-squared: 0.4784, Adjusted R-squared: 0.476
F-statistic: 200.6 on 4 and 875 DF, p-value: < 2.2e-16
8.4 Visualising regression models
Plot coefficient estimates:
# Tidy the model output
tidy_model <- tidy(rt.lm2, conf.int = TRUE)
# Remove intercept
tidy_model <- tidy_model %>% filter(term != "(Intercept)")
# Create the coefficient plot
ggplot(tidy_model, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
theme_minimal() +
labs(
x = "Coefficient Estimate",
y = "Predictor",
title = "Coefficient Estimates with Confidence Intervals"
)Plot contributions of individual variable values:
# Plot marginal effects
library("effects")Loading required package: carData
lattice theme set by effectsTheme()
See ?effectsTheme for details.
plot(Effect("Freq", mod = rt.lm2))plot(Effect("POS", mod = rt.lm2))plot(Effect("Length", mod = rt.lm2))8.5 Model assumptions and diagnostics
As a parametric method, linear regression makes numerous assumptions about the training data. It is, therefore, essential to run further tests to rule out possible violations. Among other things, the model assumptions include:
A linear relationship between the response and the quantitative predictors: The residuals should not display a clear pattern. For this reason, it is recommended to use component residual plots (e.g.,
crPlot()from thecarlibrary) for the visual identification of potentially non-linear trends.No heteroscedasticity (i.e, non-constant variance of error terms): Visually, a violation of this assumption becomes apparent if the residuals form a funnel-like shape. It is also possible to conduct a non-constant variance test
ncvTest(): If it returns \(p\)-values < 0.05, this suggests non-constant variance.No multicollinearity: Predictors should not be correlated with each other. In the model data, correlated variables have unusually high standard errors, thereby decreasing the explanatory power of both the coefficients and the model as a whole. Another diagnostic measure are variance inflation factors (VIF-scores); predictors with VIF scores > 5 are potentially collinear. They can be computed using the
vif()function.Normally distributed residuals: The residuals should follow the normal distribution. Usually, a visual inspection using
qqnorm()is sufficient, but the Shapiro-Wilke testshapiro.test()can also be run on the model residuals. Note that a \(p\)-value below 0.05 provides evidence for non-normality.
Beside the points mentioned above, it is always recommend to examine the model with regard to
outliers that might skew the regression estimates,
interactions, i.e., combined effects of predictors, and
overfitting, which results in poor model performance outside the training data.